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ABSTRACT 

Eukaryotic diversity in environmental samples is 
often assessed via PCR-based amplification of 
nSSU genes. However, estimates of diversity 
derived from pyrosequencing environmental data 
sets are often inflated, mainly because of the forma- 
tion of chimeric sequences during PCR amplifica- 
tion. Chimeras are hybrid products composed of 
distinct parental sequences that can lead to the mis- 
interpretation of diversity estimates. We have 
analyzed the effect of sample richness, evenness 
and phylogenetic diversity on the formation of 
chimeras using a nSSU data set derived from 454 
Roche pyrosequencing of replicated, large control 
pools of closely and distantly related nematode 
mock communities, of known intragenomic identity 
and richness. To further investigate how chimeric 
molecules are formed, the nSSU gene secondary 
structure was analyzed in several individuals. For 
the first time in eukaryotes, chimera formation 
proved to be higher in both richer and more genet- 
ically diverse samples, thus providing a novel per- 
spective of chimera formation in pyrosequenced 
environmental data sets. Findings contribute to a 
better understanding of the nature and mechanisms 
involved in chimera formation during PCR amplifica- 
tion of environmentally derived DNA. Moreover, 
given the similarities between biodiversity analyses 
using amplicon sequencing and those used to 
assess genomic variation, our findings have poten- 
tial broad application for identifying genetic vari- 
ation in homologous loci or multigene families in 
general. 



INTRODUCTION 

Second-generation pyrosequencing of environmental 
DNA has provided unique insights into prokaryotic 
(1,2) and eukaryotic (3,4) molecular diversity and 
ecology. Massive parallel pyrosequencing has the poten- 
tial to produce a large volume of data relatively cheaply 
and with an unprecedented read depth, generating 
millions of DNA sequences within a matter of hours (5). 
Despite advantages of high throughput sequencing, 
a major challenge is to determine the extent to which 
sequences produced from pyrosequencing-amplified 
regions of marker genes correspond to biological diversity. 
Recently, studies have recognized that biodiversity levels 
have become inflated due to artifacts associated with 
sample processing including both the PCR amplification 
and the pyrosequencing itself (6-8). PCR amplification 
with universal primers applied to genes conserved across 
phyla, such as the ribosomal nuclear small subunit 
(nSSU), is commonly used to identify microbial eukary- 
otes in natural environments. The extreme conservation of 
primer binding sites (9) and the availability of extensive 
database resources (10) has resulted in the nSSU being the 
most widely used marker for studying the molecular 
taxonomy of a diverse range of eukaryotes. Target taxa 
range from all protist kingdoms (11) to metazoan micro- 
organisms (4), that are dominated by the Nematoda (12). 
In such analyses, one of the most commonly reported 
sources of sequence artifacts associated with highly hom- 
ologous nSSU genes from environmental DNA samples is 
the formation of chimeric sequences during PCR amplifi- 
cation (8,13-16). 

Chimeric sequences, or chimeras, are generated when 
incomplete extension occurs during PCR amplification 
and the resulting amplicon re-anneals to a foreign 
DNA strand and is copied to completion in the following 
PCR cycles. Chimeras are composed of two or more 
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phylogenetically distinct parental sequences and have been 
shown to occur in PCR-amplified nSSU data sets with 
frequencies of 30-70% (6,17,18) thus leading to false di- 
versity estimates and false novel taxa. The critical factors 
that seem to affect PCR-generated recombination are the 
number of PCR cycles, PCR extension time, template 
concentration, Tag DNA polymerases and amplicon size 
(18-21). Chimera formation can be minimized experimen- 
tally by PCR optimization, nonetheless, no method has 
yet proved to be entirely effective. The importance of de- 
tecting chimeras is such that a plethora of bioinformatic 
software has also been developed, such as Chimera_Check 
(22), Bellerophon (13), CCode (23), Pintail (24), Mallard 
(17), Chimera Slayer (6) and Perseus (8). With the excep- 
tion of Perseus, such approaches will only detect evident 
induced chimeras (25) and their accuracy for chimera de- 
tection has not been rigorously tested (6) or is still at an 
early stage, especially given recent advances in environ- 
mental DNA sequencing approaches. Although metage- 
netic (4,9) analyses are clearly based on complex and 
phylogenetically diverse assemblages, the roles of sample 
richness and phylogenetic diversity in driving chimera for- 
mation are largely unknown. 

Wang and Wang (18,26) tested how sequence similarity 
between cloned 16S rRNA genes or mixed bacteria 
genomic DNA can influence PCR-based chimera forma- 
tion. Nonetheless, these investigations were performed on 
a very small scale, did not consider sample richness and 
pre-dated the current second-generation sequencing per- 
spective of amplicon pool diversity. The overarching aim 
here is to (i) analyze the effect of richness, evenness and 
genetic diversity on chimera formation and link this to 
diversity estimates and (ii) understand how chimeras are 
formed with respect to variable genetic diversity and sec- 
ondary structure of the parent nSSU molecule. To this 
end, a nSSU metagenetic data set was generated by 454 
Roche pyrosequencing of control pools of closely and dis- 
tantly related nematode mock communities of known 
identity and richness. 

MATERIAL AND METHODS 

Sample preparation 

To test if chimera formation during PCR reactions was 
associated with taxon richness or with phylogenetic 
distance, 74 Sanger-sequenced single nematode species 
were blast aligned to a contemporary Nematoda phylo- 
genetic framework (27). Subsequently, the sequences were 
aligned using ClustalX and pairwise distances (p-distance) 
were calculated using MEGA 4.1 (28). Based on the 
phylogenetic affinities of the nematode sequences, 
subsets of closely related [mean pairwise divergence 
(MPD) of 25%, referred to as 'phylogenetically close'] 
and distantly related (MPD of 40%, referred to as 'phylo- 
genetically distant') nSSU controls were generated by 
pooling the DNA extracts of 12, 24 or 48 individuals. 

DNA extraction and preparation 

DNA extraction from DESS-preserved (29) single worms 
was performed using a DNeasy Blood & Tissue Kit 



(Qiagen Inc), following the manufacturer's instructions. 
After extraction, all DNA was eluted in 40 ul of AE 
buffer and samples were stored at — 20°C until use. The 
DNA extracts from all single individuals were quantified 
using a Nanodrop spectrophotometer and diluted to 
0.5ng/ul, and five replicates of the 12, 24 and 48 individ- 
uals were selected for the closely and distantly related 
treatments. 

PCR amplification and sequencing analysis 

The primers SSUF04 forward (5'-GCTTGTCTCAAAG 
ATTAAGCC-3') and SSUR22 reverse (5'-GCCTGCTGC 
CTTCCTTGGA-3') were used to amplify -450 bp of the 
nSSU rDNA (18S rDNA) region (30). Fusion primers 
were then developed according to Fonseca et al. (4). 
PCR amplification reactions and the thermocycle for the 
targeted nSSU region were optimized. Optimized reac- 
tions were performed using 0.25 ng/ul of genomic DNA 
template in 3 x 40 ul reactions using Pfu DNA polymerase 
(Promega) for each of the closely and distantly related 
nematode pools (12, 24 and 48 individuals) and all indi- 
vidual DNA extracts. PCR thermocycle conditions con- 
sisted of a 2-min denaturation step at 95°C followed by 35 
cycles (thus facilitating the generation of chimeras) 
(6,18,26) of lmin at 95°C, 45 s at 55°C, 3min at 72°C 
and a final extension of lOmin at 72° C. Negative 
controls (ultrapure water only) were included for all 
amplification reactions. Electrophoresis of triplicate PCR 
products was undertaken on a 2% gel with Top Vision™ 
LM GQ Agarose (Fermentas), and the expected 450-bp 
fragment was purified using the QIAquick Gel Extraction 
Kit (Qiagen), following the manufacturer's instruction. 
All purified PCR products were quantified with an 
Agilent Bioanalyser 2100 and diluted to the same concen- 
tration (10 ng/ul). PCR amplifications from single nema- 
todes and pooled nematodes were sequenced in a single 
direction (A-Amplicon) on a quarter and three-quarters of 
a plate, respectively, using a 454 Roche GSFLX (454 
Life Sciences, Roche Applied Science) sequencing 
platform at Liverpool University's Centre for Genomic 
Research, UK. 

Denoised reads and detection of chimeric PCR molecules 

Pyrosequencing reads derived from 454 Roche data 
contain a substantial number of errors (referred to as 
noise), which includes sequencing errors mainly derived 
from the inclusion or deletion of single bases in 
homopolymer runs of 3 bp or longer, PCR single base 
substitutions and PCR chimeras (14,31). AmpliconNoise 
was used to remove noise from the pyrosequencing data; 
this comprises filtering, flowgram and sequence clustering 
steps. It has been shown to reduce noise by ~50% in en- 
vironmental data sets (8). Subsequently, chimeras were 
identified using Perseus (8); this algorithm generates a 
Chimera Index (CI) for each read that is >0 with higher 
values corresponding to reads that are most likely to be 
chimeric. Perseus by pairwise alignments to all sequences 
of greater than or equal abundance identifies the most 
likely parent sequences of the candidate read and the 
most likely break point. Logistic regression is then used 
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to classify chimeras so that the pyrosequencing data 
output lists chimeric and non-chimeric sequences. The 
lower the probability of the sequence evolving naturally, 
the higher the CI (8). 

Perseus finds break point positions in the two parent 
sequences. To compare across the whole data set, it is 
necessary to fix these positions relative to a reference 
sequence. To do this, a four-way alignment between 
each chimeric sequence, its two parents and the 
Caenorhabditis elegans reference sequence was formed 
(GenBank/EMBL accession number EU 196001). The 
most likely break point was identified by minimizing the 
number of differences between the sections of the parents 
contributing to the chimera and the chimera itself. The 
position of each break point on the reference sequence 
was then recorded and from this, the frequency breaks 
occurring at each position could be calculated. MFold 
RNA-folding software was used to predict the potential 
role on chimera formation of the secondary structure of 
the 18S rDNA amplicon region (32). 

Generation of operational taxonomic units 

Denoised mock nematode community data from which 
chimeras had been removed was used to identify 
Operational Taxonomic Units (OTUs). OTUs were 
calculated using a complete linkage-clustering algorithm, 
measuring the distance between the most distant members 
in each cluster, at a 99% identity cut-off. The number of 
OTUs generated was then used to determine the effect of 
taxa richness on chimera formation within a sample. 
Although numbers of reads within treatments varied this 
did not have a significant (ANOVA, P > 0.05) effect on 
chimera frequencies, number of OTUs and/or Shannon 
Index. However, all the analyses were also performed on 
a normalized data set by subsampling equal read numbers 
from each treatment and observations were found to be 
congruent with the non-normalized data (data not 
shown). 

Statistical analysis 

Species richness, or in this case OTU richness, takes no 
account of the evenness of the distribution. An index with 
better properties is the Shannon index (33). This increases 
with more taxa but also as the distribution of abundances 
across taxa becomes more even. The Shannon Index of 
biodiversity was established for each sample using Vegan 
R (34). To analyze the relationship between overall 
chimera percentage and the explanatory variables (e.g. 
phylogenetic relatedness, richness, diversity, number of 
reads), a linear model was fitted to the data, giving a 
multiplicative coefficient for each explanatory variable. 
An ANOVA was then performed to statistically determine 
which of the variables had an effect on the chimera per- 
centage. Variables without a significant effect on chimera 
percentage were removed and the model was refitted to 
give accurate ANOVA results. A probability (P-value) 
<0.05 was considered significant. 



RESULTS AND DISCUSSION 

Here, it was possible to assess the effect of OTU richness, 
evenness and phylogenetic relatedness on chimera forma- 
tion in a nSSU second-generation sequencing environmen- 
tal data set. Moreover, a rigorous experimental design 
and bioinformatic analysis facilitated the identification 
of chimera breakpoint frequencies within the parent 
nSSU molecule. The 'mock community' contained equiva- 
lent concentration of 18S rDNA genes of individual nema- 
todes of known identity and richness (GenBank Accession 
Numbers JN968213-JN968286). Nematodes are the most 
abundant phylum of meiofaunal environmental samples 
(4) representing a major part of biodiversity and 
perform numerous essential roles in ecosystems processes 
(35,36). The nematodes that were chosen included both 
phylogenetically distant and closely related species to 
emulate a likely environmental assemblage. Amplicons 
were sequenced on a Roche 454 GSFLX platform and 
generated a total of 339 515 pyrosequence reads 
(Genbank/EMBL/DDBJ short read archive accession 
number SRA043810.1). AmpliconNoise (14) generated 
236406 reads after removing errors arising from PCR 
and pyrosequencing errors and truncating sequences to a 
uniform 200 bp. Chimeras were detected in denoised 
'mock community' data using the Perseus algorithm (8) 
and ~42% of the sequences were classified as chimeric 
and were removed before taxon richness was assessed by 
clustering sequences into OTUs at a 99% identity thresh- 
old for each data set. A summary of the mean number of 
reads for each data set, denoised sequences, chimera per- 
centages and OTUs is given in Table 1. 

Denoised sequences contained between ~15% and 60% 
of chimeras in some pools, confirming that 35 cycle PCRs 
do indeed generate numerous chimeras as already con- 
firmed by previous studies (18,19,25) even within a small 
mock environmental data set. The results stress the im- 
portance of a chimera removal step to allow an accurate 
estimation of OTU numbers and robust estimates of bio- 
diversity levels in environmental samples (8,15,37). 

Overall, the mean OTU numbers were approximately 
double the number of unique nematode species in each 
pool. This could be associated not only with sequencing 
artifacts but also because organisms frequently contain 
multiple copies of heterogeneous nSSU genes (38). 
To assess the impact on the data set of multi-copy 
nSSUs, all single nematode PCR products were amplified 



Table 1. Mean numbers of OTUs, denoised sequences, chimera 
percentages and reads for the pools of close and distantly related 
nematodes with 48, 24 and 12 individuals, respectively 



Species 


Species 


OTUs 


Denoised 


Chimera 


Read 


phylogeny 


number 


at 99% 


sequences 


(%) 


number 


Close 


48 


87.6 


138.40 


35.60 


13 882.20 


Close 


24 


40.4 


63.20 


34.55 


3809.00 


Close 


12 


35.8 


42.80 


14.57 


6159.80 


Distant 


48 


63.2 


161.00 


58.98 


5657.80 


Distant 


24 


53.6 


119.00 


53.57 


10134.20 


Distant 


12 


34.4 


58.20 


39.93 


7638.20 
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with unique MID-tag sequences. Of the 74 MID-tagged 
single nematode amplifications, 61 were single copy 18S 
rDNA and 10 were double copy but all taxa were repre- 
sented by a similar total number of sequences in PCR 
reactions (data not shown). 

A striking observation was the difference in chimera 
formation between close and distantly related nematode 
assemblages. In the latter, the mean percentage of se- 
quences classified as chimeric was 55% for the 24 and 48 
species pools and was significantly higher (ANOVA, 
P< 0.001) than the equivalent pools of closely related 
nematode assemblages that had 35% chimeras. In line 
with the previous observation, the mean percentage of 
chimeras was significantly lower (ANOVA, P<0.01) in 
the 12 species pools irrespective of the similarity or 
distance of the individuals in the nematode assemblages 
(Table 1). These results suggest that chimera formation in 
the 5' region of nSSU amplicon pools is significantly 
higher in more phylogenetically diverse and richer data 
sets. 

Although the studies were on a smaller scale, Qiu et al. 
(19) and Wang and Wang (26) analyzed chimera forma- 
tion with bacterial rRNA clones and also found that PCR 
artifacts and chimera frequency increased as species diver- 
sity increased. To further confirm this, OTU diversity 
within the two nematode assemblages (close and distant 
pools) was expressed using the Shannon Index (33). OTU 
diversity (Shannon Index) and OTU richness (OTUs 
numbers) showed a significant effect (ANOVA, P<0.01, 
P< 0.001) on chimera frequency further supporting the 
hypothesis that more diverse and richer samples generate 
a higher frequency of chimeric molecules. Additionally, 
diversity (Shannon index) had a significant effect both 
on the closely (P < 0.05, P = 0.0324) and distantly 
(P<0.01, P = 0.0023) related nematode assemblages, 
and had a positive relationship with chimera frequency 
(Figure 1). 

Analysis of chimera breakpoint occurrence in nSSU 
amplicon sequences revealed that regions with higher nu- 
cleotide sequence similarity had significantly higher break- 
point frequencies (P < 0.001, P = 0.00039) (Figure 2a and 
b). Indeed, in studies with bacteria using the 16S rRNA 
gene a large number of competing templates with fairly 
high sequence similarity generated more chimeras 
(6,16,26). Presumably, one explanation for this phenom- 
enon may be the priming of strand synthesis by prema- 
turely terminated templates in the next PCR round. 

Different copies of the nSSU genes from the same 
organism may differ by up to 6.5% (18,38), and in the 
present study alignment of close and distantly related 
nematodes indicated an overall inter-specific sequence di- 
vergence of 10%, evidently enough to generate chimeras. 
To better reflect an environmental data set, in the present 
study, an alignment of 10 representatives of every phylum 
that is represented by meiofaunal taxa was performed and 
a 23% overall mean inter-specific sequence distance was 
observed for the same nSSU region. In fact, Wang and 
Wang (18) suggested that, despite some degree of nucleo- 
tide mismatching, partly terminated heterologous 16S 
rDNA templates can often be completed in the subsequent 
polymerization step resulting in chimeras. Thus, the 
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Figure 1. Chimera percentage and Shannon Index of closely and dis- 
tantly related pools of nematodes. 



possibility that formation of chimeric sequences between 
different copies of the nSSU genes was also likely to occur 
(6,16,26) is now confirmed with this experiment. It is 
probably the degree of sequence similarity within each 
individual that may determine chimera breakpoint forma- 
tion. This is an issue inherently associated with the fact 
that the nSSU gene is a multicopy gene, and intra-specific 
variability might have a determinant effect on chimera 
formation, especially when sample richness is quite high. 
Haas et al. (6) and Wang and Wang (26) verified that more 
similar 16S rDNA genes more readily form chimeras but 
they did acknowledge the possibility of chimera formation 
among more divergent species. In fact, the latter phenom- 
enon is evident for the first time in the present study, 
where chimeras are more often generated among richer, 
phylogenetically diverse samples, although the region 
where the chimera forms has to have sufficient conserva- 
tion to favor hybridization and chimera formation. Our 
alternative perspective of chimera formation in mock 
communities may be locus specific, or may be related to 
our larger sample sizes (« = 12 48) that are predicted to 
emulate more closely, true environmental samples. 

Chimeras are generally composed of two true se- 
quences, occasionally more (8), with a discrete break 
point where the transition from one sequence to another 
occurs. In the present data set, the distribution of chimera 
breakpoints showed a similar pattern across closely and 
distantly related nematode assemblages, with a mean peak 
of frequency at the first 140 bp of the selected nSSU region 
(Supplementary Figure SI). Although GC content is 
thought to correlate with chimera formation due to inef- 
ficient strand separation and susceptibility to secondary 
structure formation, a detailed analysis of the parent 
chimeric sequences at the breakpoints did not reveal a 
significant correlation between GC rich regions and 
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Figure 2. (a) Nucleotide diversity (Shannon Index) and (b) breakpoint frequencies occurrence in single nematodes and parental chimeric sequences, 
respectively. 



chimera frequencies. To further investigate the breakpoint 
region, the secondary structure of the amplified nSSU 
fragment was modeled in 12 single nematode sequences 
at 55°C and 65°C folding temperatures (Supplementary 
Table SI). Analysis of the nSSU secondary struc- 
ture showed that the regions where the breakpoints 
occurred coincided with hairpin loop structures at both 



temperatures, although at 65° C regions of secondary 
structure were less abundant (Supplementary Table SI 
and Figure 3). 

Hairpin-loops are common motifs in nSSU gene sec- 
ondary structure due to their importance in ribosome 
folding and function (39) and their presence requires 
greater energy for melting to occur during PCR and 
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their maintenance will make it more difficult for DNA 
polymerases to read through. Modeling in silico revealed 
that the nSSU amplicon region in the present study 
retained some secondary structure at the primer annealing 
temperature (55°C) and may have been one of the causes 
of premature termination of DNA synthesis. This is 
corroborated by results of multiple sequence alignments 
of PCR-induced chimeras (40) that reveal the recombinant 
regions were correlated with DNA template secondary 
structures. Fewer secondary structures of the nSSU 



amplicon were found at 65°C, suggesting that (i) primers 
should be designed with a high annealing temperature 
and/or (ii) genes chosen for environmental metagenetic 
analyses should be selected for a low tendency to second- 
ary structure formation which should reduce the dispos- 
ition of complex samples to form chimeras. Nonetheless, 
other factors are known to minimize frequency of PCR 
recombination such as adding betaine and 
dimethylsulfoxide (41); dNTPs should never be limited 
so that amplification bias is reduced and the use of 
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shorter amplicons will reduce prematurely terminated 
extension. Further to this, the choice of selectively amp- 
lifying target loci from genomic libraries or designing 
genome-specific amplification primers that will selectively 
amplify a single homologue (42) also narrow the oppor- 
tunity for chimera formation. 

The investigative use of higher annealing temperatures 
in our study was not possible since the optimal thermo- 
cycling conditions used to amplify meiofaunal representa- 
tives (9,43,44) preclude more stringent annealing 
temperatures. In fact, an intuitive general rule for metage- 
netic studies would be to avoid high annealing tempera- 
tures to ensure the co-amplification of large ranges of taxa 
from disparate phyla. This implies not only having very 
high quality DNA samples within uncontaminated labora- 
tory environments but also compulsory stringent analyses 
by using algorithms to remove artifacts and/ or putative 
chimeras after sequencing. In addition, the use of refer- 
ence databases to detect chimeric molecules in environ- 
mental data sets is complicated by their unpredictable 
diversity, meaning that reference data may not be repre- 
sentative of the true diversity. On the other hand, the ex- 
istence of chimeric sequences in public DNA databases is 
well known (24,45) and the risk of classifying chimeras as 
new organisms is becoming higher than the risk of neg- 
lecting non-chimeric ones. 

Experience from recent studies (4,9) where ~65% of the 
sequences generated from a 454 Roche environmental 
data set were discarded leads us to suggest that metage- 
netic analyses are the ideal 'breeding ground' for recom- 
binant DNA molecules. DNA amplification by PCR has 
become the main crucial step used for next-generation 
sequencing technologies in the analysis of environmental 
samples and so PCR-derived artifacts are continuously 
increasing. Based on our analyses, the theory of chimera 
formation having a stochastic distribution (46) should 
probably be re-evaluated because their occurrence can 
be influenced by several factors, namely PCR conditions, 
amplicon nucleotide diversity, molecule folding structure 
and sequencing strategies. In fact, almost all steps of the 
molecular approach can introduce biases or errors, 
including the target DNA template concentration (6,19), 
DNA polymerases (20,47) and thermal cycling conditions 
(16,19,26,48). Moreover, the interaction between nucleic 
acids, DNA polymerases and thermal cycling are likely to 
be dynamic, suggesting that the identification of optimal 
molecular biological parameters that reduce chimera for- 
mation are likely to be locus/taxon specific. 

Overall, we anticipate that our findings will enhance 
our understanding of chimera formation and associated 
optimization of pyrosequencing strategies when con- 
ducting PCR-based DNA amplification of homologous 
loci or multigene families. As technologies evolve (49), 
sequencing is likely to be employed to cle novo analyse 
gene partitions from a range of genomic loci in order to 
address questions that have been hitherto intractable using 
chain termination sequencing (50). Thus, insights gained 
here have relevance to biodiversity identification and 
associated fields such as gene-environment interactions 
in host parasite co-evolution (51); pathogen recognition 



(52); genes underpinning immune response (53) and 
predator-prey arms races (54,55). 
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